Allil03  7333E7 


NISTIR  4776 


Training  Feed  Forward  Neural 
Networks  Using  Conjugate 
Gradients 


Patrick  J.  Grother 
James  L Blue 


U.S.  DEPARTMENl  OF  COMMERCE 
Technology  Administration 
National  Institute  of  Standards 
and  Technology 
Advanced  Systems  Division 
Computer  Systems  Laboratory 
Gaithersburg,  MD  20899 


—QC  — 
100 
.U56 
//4776 
1992 


U.S.  DEPARTMENT  OF  COMMERCE 
Rockwell  A.  Schnabel,  Acting  Secretary 

NATIONAL  INSTITUTE  OF  STANDARDS 
AND  TECHNOLOGY 
John  W.  Lyons,  Director 


NIST 


rraining  Feed  Forward  Neural 
Networks  Using  Conjugate 
aradients 


Patrick  J.  Grother 
James  L Blue 


U.S.  DEPARTMENT  OF  COMMERCE 
Technology  Administration 
National  Institute  of  Standards 
and  Technology 
Advanced  Systems  Division 
Computer  Systems  Laboratory 
Gaithersburg,  MD  20899 


February  1992 


U.S.  DEPARTMENT  OF  COMMERCE 
Rockwell  A.  Schnabel,  Acting  Secretary 

NATIONAL  INSTITUTE  OF  STANDARDS 
AND  TECHNOLOGY 
John  W.  Lyons,  Director 


Training  feed-forward  neural  networks  using  conjugate  gradients 

James  L.  Blue  and  Patrick  J.  Grother 

National  Institute  of  Standards  and  Technology 
Gaithersburg,  MD  20899 


Abstract 

Neural  networks  for  optical  character  recognition  are  still  being  trained  using 
back  propagation,  even  though  conjugate  gradient  methods  have  been  shown 
to  be  much  faster.  Most  multilayer  perceptron  network  training  results  in  the 
hterature  are  obtained  for  small  and  unrealistic  problems  or  from  data  sets  that 
are  proprietary  and  not  available  for  comparison  testing. 

We  present  results  on  a large  realistic  pattern  set  containing  2000  training  and 
1434  testing  exemplars.  Each  pattern  is  composed  of  32  Gabor  coefficients 
obtained  from  a 32  by  32  pixel  binary  image  of  a hand-written  digit  segmented 
from  the  NIST  Handwriting  Image  Data  Base.  These  sets  are  believed  to  have 
approximately  1%  segmentation  errors.  Comparative  results  for  MpUer’s  scaled 
conjugate  gradient  method  and  for  standard  back  propagation  axe  presented  for 
runs  on  a serial  scientific  workstation  and  a highly  parallel  computer. 

Typical  training  on  a network  with  32  inputs,  32  hidden  nodes,  and  10  output 
nodes  gives  a 98%  to  99%  recognition  for  the  training  set  and  95%  for  the  test 
set.  Training  with  conjugate  gradients  requires  fewer  than  200  iterations;  times 
are  about  20  to  40  minutes  on  a scientific  workstation  and  6 minutes  on  the 
highly  parallel  computer.  Testing  (classification)  is  done  at  the  rate  of  600  and 
1600  patterns  per  second  on  the  scientific  workstation  and  on  the  highly  parallel 
computer  respectively. 

These  results  suggest  that  commercial  hand-written  character  recognition  sys- 
tems with  great  economic  potential  are  feasible. 

1.  Introduction 

Backpropagation  (BP)  has  been  used  for  some  years^  to  train  feed-forward  net- 
works. Mathematically,  “training”  means  minimizing  an  error  function,  the  sum 
of  the  squares  of  the  errors  in  the  outputs.  While  useful  for  training  networks, 
BP  has  the  disadvantages  that  convergence  is  slow^  and  that  there  are,  in  the 
usual  implementation^  two  adjustable  parcimeters,  7/  and  q,  that  have  to  be  de- 
termined. A third  disadvantage,  that  convergence  is  often  to  a local  minimum 
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of  the  error  function  rather  than  a global  minimum,  is  inherent  in  the  problem 
and  not  restricted  to  BP. 

Since  BP  corresponds  (approximately)  to  using  a steepest-descent  method  for 
minimizing  the  error  fimction,  and  steepest-descent  methods  are  known to 
converge  slowly  when  near  a minimum,  slow  convergence  is  not  surprising. 

Conjugate  gradient  (CG)  methods  have  been  used  for  many  years for  minimiz- 
ing fimctions,  and  have  recently^  been  discovered  by  the  neural  network  commu- 
nity. The  usual  CG  methods  require  a Hne  search  or  its  equivalent.  MpUer®  has 
introduced  a scaled  conjugate  gradient  (SCG)  method;  instead  of  a hne  search, 
he  uses  an  estimate  of  the  second  derivative  along  the  search  direction  to  find 
the  approximate  minimum. 

Many  BP  implementations  for  feed-forward  networks  are  available,  but  CG  and 
SCG  implementations  have  not  been  available.  We  have  produced  simple  and 
easy-to  use  SCG  and  BP  programs  that  shaxe  a common  driver  program.  These 
programs  are  available  from  the  authors  by  electronic  mail.  ^ 

We  also  present  a large  and  realistic  test  set  of  data  for  use  with  the  SCG  and 
BP  programs.  The  data  axe  coefficients  from  a collection  of  handwritten  digits. 
The  data  set  is  also  available  from  the  authors. 

1.1.  The  problem 

We  assume  a standard  multilayer  perceptron  network,  fully-connected,  with  N 
inputs,  hidden  neurons,  and  output  neurons.  Define  to  be  the 
activation  of  the  element  of  the  layer  of  the  network,  due  to  the 
pattern  of  inputs,  with  <l)[p^  the  input  value  for  neuron  i.  Define  the  weight 
matrix  interconnecting  the  elements  of  neuron  layer  m — 1 to  layer  m to  be 
and  let  / be  some  nonhnear  squashing  function,  such  as  the  standard 
sigmoid  function,  f{x)  = 1/[1  -|- exp(— a*)].  The  input  pattern  vectors  propagate 
forward  through  successive  layers  of  neurons  according  to 

^hp  = iz  4p  = / ; 

i=0 

= f ; (1) 

h=0 

where  (f)^^  = 1 and  (f)^^  = 1,  so  that  w^Qf^  and  are  biases  for  the  squashing 
functions. 

^Send  electronic  ma.i1  to  jlbCaizure. cam. nist.gov;  include  your  name,  institution,  address, 
and  electronic  mail  ciddress. 
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Pattern  p has  a desired  output  activation,  or  target  value,  of  rpjp  at  output 
neuron  j.  For  character  classification,  each  pattern  has  exactly  one  V’ jp  equal  to 
one  and  the  rest  equal  to  zero.  Assume  a set  of  patterns,  a set  to  be  used 
for  “training”  the  network.  One  measure  of  the  goodness  of  a set  of  weights  is 


NW  nW 


S = 


2Ni^)Nip) 


E E [^jp  - 


p=i  j=i 


(2) 


the  root-mean-square  (RMS)  error,  is  a measure  of  the  output  activation 
error  averaged  over  all  patterns  and  all  output  neurons.  Minimizing  £ provides 
one  way  of  determining  the  weight  matrices  and  it  is  a substitute  for 

what  is  really  desired,  to  determine  and  so  as  to  do  the  best  possible 
character  identification  over  some  large  and  unknown  set  of  “testing”  input 
patterns. 


The  negative  gradients  of  £ with  respect  to  the  weights  are 


9^  = 


d£ 


and  9hi  = - 


d£ 


(3) 


Define  auxiliary  variables  and  8^^: 


nw 


= {^jp  - /'(^p  ) and  4p  = E ^ip  n’ih /'(a^ip ),  (4) 


3=0 


where  f'{x)  is  the  derivative  of  /;  for  the  usual  sigmoid  function,  /'(r ) = /(a;)[l  — 
f{x)].  Then  the  negative  gradients  are 


(2)  _ ^ 

^ih  jv(2)iV(p) 


Ar(i>) 


E^ipVLp  and  = 


1 


p=i 


m2)N{p) 


E W- 


(5) 


p=l 


To  simphfy  the  notation  in  later  sections,  let  w be  a vector  of  length  = 

[Ar(°)  -f  -I-  -|-  l]iV(^),  formed  by  concatenating  all  the  and  in 

some  order.  Similarly,  g is  the  vector  of  negative  gradients  of  £ with  respect  to 
w,  formed  by  concatenating  all  the  and  g^^  in  the  same  order. 


The  standard  “network  training”  problem  then  reduces  to  finding  a set  of  weights, 
w,  to  minimize  the  mean-square  activation  error,  £.  An  alternative  minimization 
problem  is  a regularized  version,®  sometimes  known  as  the  Levenberg-Marquardt 
method  and  sometimes  known  as  ridge  regression:  minimize  a hnear  combination 
of  the  mean-square  activation  error  and  the  mean-square  weight: 


5(w;/z)  = £ ■}■ 


9- 


EE 

t=0  h=l 


+ EE 


h=o  j=i 


(6) 
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There  is  an  additional  parameter,  /i,  which  must  be  chosen  somehow.  Since 
S = S{yr\  0),  the  remainder  of  this  paper  will  use  5^(w;  p).  The  negative  gradients 
become 

7V(2)A7'(p)  ^ ^'‘P  jV("') 

p=l 

^ Nwmv)  ^ 

p=i 

Minimization  of  a function  of  many  variables  is  a mature  field;  there  are  nu- 
merous algorithms  and  more  numerous  variations  of  them.  All  are  iterative, 
generating  a sequence  of  sets  of  weights.  For  neural  network  appHcations  in 
character  recognition,  where  typical  values  of  might  range  from  a few  hun- 
dred to  many  thousand,  computationally  feasible  algorithms  can  use  the  gradi- 
ent vector  but  not  the  Hessian  matrix  (the  matrix  of  second  derivatives  of  the 
function).  The  typical  algorithm  hcis  the  structure 

1.  Choose  an  initial  vector  Wq  and  set  A:  = 0.  In  the  absence  of  a better  idea, 
random  starting  values  axe  used. 

2.  Calculate  the  fimction,  ^(wjt;//),  and  its  negative  gradient,  g*. 

3.  Calculate  a step.  Aw*.  Let  wjfe+i  = Wk  + Awfc. 

4.  Calculate  the  function,  £^(wjfc+i;  /i),  and  its  negative  gradient,  gt+i- 

5.  If  the  Wjfc+i  is  unsatisfactory,  make  some  adjustments,  discard  "Wk+i  and 
go  to  3.  Otherwise,  increase  A;  by  1. 

6.  If  the  iteration  has  converged,  or  if  too  many  iterations  have  been  done, 
or  if  progress  seems  hopeless,  quit. 

7.  Go  to  3. 

This  generalized  procedure,  with  suitable  elaborations  of  steps  3,  5,  and  6,  always 
succeeds  in  finding  a local  minimum  of  the  function.  There  is  no  practical  way 
to  find  the  absolute  minimum,  and  5(w;  //)  has  many  local  minima.  Essentially 
all  the  computer  time  is  spent  in  step  4. 

1.2.  Backpropagation  minimization  of  5(w;/r) 

For  a given  w,  g is  the  direction  of  steepest  descent  for  5(w;  p),  the  direction 
in  which  £^(w;  fi)  decreases  most  rapidly.  Many  variations  axe  possible.  One 
possibihty  is 

Awjfc  = Tjkgk  (8) 

with  Tjk  determined  by  a fine  search,  approximately  minimizing  the  fimction  with 
respect  to  T]k.  The  standard  backpropagation  method^  uses 

Awfc  = Tjgk  + otAwk-i  (9) 
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with  fixed  parameters  rj  and  a,  not  determined  by  the  algorithm.  For  large 
problems,  (9)  works  well  for  a few  iterations,  then  slows  down  drastically.  An 
example  is  given  in  Section  4. 

1.3.  Conjugate  gradient  minimization  of 

For  many  years,  conjugate  gradient  algorithms  have  been  used  to  minimize 
functions®’®  and  good  Fortran  programs  have  been  available.  More  recently, 
the  neural  network  community  has  discovered  conjugate  gradients.^’®  The  step 
used  is 

Awfc  = ak{gk  + /?fcAwfc_i)  (10) 

where  /3k  is  calculated  by  the  algorithm  to  make  Aw^  and  Awjt_i  conjugate, 
or  orthogonal  in  a generahzed  meaning  of  the  word.  The  factor  o*  is  often 
determined  by  some  kind  of  fine  search,  using  about  three  function  calls. 

MpUer®  uses  a (temporary)  small  value  for  ojk,  does  a fimction  evaluation  in 
order  to  approximate  the  second  derivative  in  the  search  direction,  then  uses 
the  approximate  second  derivative  to  select  a final  Q/t;  this  uses  two  function 
calls.  (If  the  new  weights  result  in  an  increased  error,  they  are  rejected  and 
some  parameters  are  adjusted  until  the  new  weights  reduce  the  error.  In  this 
case,  a successfiil  iteration  takes  more  than  two  function  calls.)  MpUer  calls 
his  method  Scaled  Conjugate  Gradients  (SCG).  Our  experiments  on  character 
recognition  problems  showed  SCG  to  be  preferable  to  the  program  of  Shanno 
and  Phua.^^ 


2.  Programs 

Fortran  programs  to  train  neural  networks  using  BP  and  SCG  have  been  writ- 
ten; the  BP  and  SCG  programs  differ  only  in  the  subprogram  called  to  minimize 
5(w;  /i).  The  implementations  for  serial  computers  are  reasonably  straightfor- 
ward; the  implementation  for  parallel  computers  will  be  discussed  in  Section 
2.1. 

Deciding  when  to  stop  an  iterative  procedure  is  not  straightforward.  In  prin- 
ciple, stopping  is  simple:  stop  when  a local  minimum  has  been  reached.  At  a 
local  minimum,  g = 0.  (The  same  is  true  at  at  a saddle  point  or  a local  max- 
imum, but  checking  for  these  requires  computing  the  Hessian  matrix,  which  is 
impractical.)  However,  in  finite-precision  arithmetic  g = 0 is  attained  only  in 
unusual  circumstances.  The  programs  have  five  stopping  criteria  built  in: 


1.  k >=  Cl 


2. 


S{wk;fi)  <=  C2 
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(Too  many  iterations) 
(Achieved  error  goal) 


3. 


Igfcl  <=  Calwfcl  (Achieved  gradient  goal) 

4.  S{wk-,fi)  >—  (1  — C^)Ek-K  (Error  decreasing  too  slowly) 

5.  Rk  >=  Rk-K  — C5  (Right  classifications  decreasing  too  slowly) 


where  Ci,  C2,  C3,  C4,  C5,  and  K are  constants  supplied  by  the  user.  (Criteria  4 
and  5 are  checked  only  every  K iterations.)  The  results  obtained  depend  signif- 
icantly on  which  criterion  apphes  in  the  particular  run,  which  in  turn  depends 
on  the  constants.  Some  examples  axe  given  in  Section  4. 

2.1.  Parallel  implementation 

The  efficient  parallel  implementation  of  neural  networks  in  hardware  and  soft- 
ware is  an  active  area  of  current  research.  The  motivation  is  pragmatic;  faster 
training  algorithms  and  their  implementation  allow  more  compHcated  larger  net- 
works to  be  evaluated.  The  parallelism  inherent  in  artifical  neural  nets  is  well 
known.  In  particular  the  multilayer  perceptrons  trained  by  BP  and  SCG  are 
both  readily  made  parallel.  Their  implementation  is  particularly  suited  to  mas- 
sively parallel  fine  grained  SIMD  architectures.  Machines  of  this  type,  such  as 
the  AMT  DAP  or  the  Loral  Industries  MPP^,  are  typified  by  tightly  coupled 
processors  connected  by  high  bandwidth  bus. 


2.1.1.  Forward  propagation 

The  multilayer  perceptron  networks  require  that  pattern  vectors  to  be  propa- 
gated forward  through  successive  layers  of  neurons.  Define  activation  matrices 
and  whose  columns  are  the  activation  vectors  for  each  pattern.  The 
parallel  equations  corresponding  to  1 are 


^(1)  = f and  ^(2)  ^ ')■  (11) 


The  activations  of  neurons  within  a layer  are  independent  and  their  parallel 
calculation  is  thus  suggested.  The  problem  is  then  one  of  matrix  multiplication. 

In  the  serial  computer  the  choice  of  the  matrix  multiphcation  algorithm  used 
is  largely  irrelevant,  since  all  operations  are  scalar.  Although  the  hterature  on 

^Certain  commercial  equipment  is  identified  in  order  to  adequately  specify  or  describe  the 
subject  matter  of  this  work.  In  no  case  does  such  identification  imply  recommendation  or 
endorsement  by  the  National  Institute  of  Standards  and  Technology,  nor  does  it  imply  that  the 
equipment  identified  is  necessarily  the  best  available  for  the  purpose. 
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parallel  matrix  multiplication  is  vast  the  outer  product^^  method  on  the  array 
processor^  is  found  to  be  superior  if  the  matrices  are  much  larger  than  the 
number  of  processors.  The  following  pseudocode  multiphes  two  matrices,  ml 
and  m2,  obtaining  the  product  m3.  The  outer  product  method  is  well  suited  to 
machines  such  as  the  AMT  DAP  which  have  hardware  broadcasting  instructions. 


real  ml(L,  M) 
real  m2(M,  N) 
real  m3(L,  N) 

real  acol(L) , arow(N) 
real  row_replicas(L,  N) 
real  col_replicas(L,  N) 


^ L rows  X M columns  matrix 
M rows  X N columns  matrix 
^ L rows  X N columns  product 

^ rows  and  columns  of  ml  and  m2 
^ arow  is  broadcast  down  this  matrix 
^ acol  is  broadcast  across  this  matrix 


m3  = 0.0 
do  i = 1,  M 


acol  = ml(,i) 

^ col  i of  ml 

eirow  = m2(i,) 

^ row  i of  m2 

row_replicas  = row_broadcast (arow,  L)  ^ aU  rows  the  same 

col_replicas  = col.broadcast (acol,  N)  ail  cols  the  same 

m3  = m3  + row  replicas  * 

} 

col_replicas 

The  expensive  operations  here  are 
parallel  accumulation.  The  more 
products  thus: 

the  parallel  multiplication  and  the  subsequent 
naive  inner  product  algorithm  computes  dot 

real  ml(L,  M) 

^ L rows  X M columns  matrix 

real  m2(M,  N) 

^ M rows  X N columns  matrix 

real  m3(L,  N) 

^ L rows  X N columns  result 

real  arow(N) 

^ rows  of  ml  and  m2 

real  col_replicas(L,  N) 

acol  is  broadcast  across  this  matrix 

do  i = 1,  N 

{ 

^An  array  processor  is  considered  here  to  be  a finite  two  dimensional  rectangular  array  of 
SIMD  processing  elements.  Operations  on  matrices  larger  them  this  array  are  looped  transpar- 
ently by  the  machine. 
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arow  = ml(i,) 

col_replicas  = col_broadcast(arow,  N)  ^ all  cols  the  same 

m3(,i)  = siiin_over  rows  (m2  * col  replicas)  sum  all  columns 

} 

The  parallel  product  is  inexpensive;  the  inefficiency  of  the  method  is  determined 
by  the  speed  of  interprocessor  communication  necessary  for  the  term  collection 
in  the  “sum  over  rows.”  The  outer  product  algorithm  is  not  dependent  on  this 
bandwidth  and  is  therefore  faster. 

2.1.2.  Weight  Modification 

Forward  propagation  is  necessary  in  training  and  in  classification.  During  train- 
ing the  weight  update  is  dependent  on  the  error  gradient.  The  error,  E,  is  given 
as  the  difference  between  the  target  output  and  the  actual  output  thus: 


E = ^ - (12) 

The  output  layer  error  is  weighted  by  the  first  deriviative  of  the  squashed  output 
activation 


A(2)  = E 0 /'  j (13) 

where  0 denotes  parallel  component-by-component  multiphcation.  The  hidden 
layer  error  is  obtained  by  passing  the  output  deltas  back  through  the  final  weight 
layer. 


= (w(2)  0 f (14) 

The  negative  gradients  of  the  error  with  respect  to  the  individual  weights  are 

= ^{0)^(1)^  g(2)  ^ ^(1)^(2)^  (15) 

The  weights  are  independent  and  therefore  the  update  operations  are  homoge- 
nous and  purely  parallel.  The  weight  update  time  is  negfigible  compared  to 
doing  the  forward  propagation  and  gradient  calculation. 

3.  The  handwritten  characters  database 

The  data  set  used  for  the  examples  in  this  paper  is  known  as  FL3;  it  contains 
features  extracted  from  handprinted  digits.  There  are  2000  patterns  in  the 
training  set  and  1434  in  the  testing  set. 
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The  inital  chajacter  images  were  obtained  by  isolating  and  segmenting  full  page 
images  from  the  NIST  Handprinted  Characters  Database;  segmentation  is  de- 
scribed in  Wilkinson.  The  resulting  character  images  were  centered  in  a 32 
X 32  array  of  pixels.  Each  character  image  then  had  a shear  transformation,  a 
Gabor  feature  extraction,  reducing  each  1024-pixel  binary  image  to  a 32-feature 
pattern. 

3.1.  Shear  Transformation 

The  purpose  of  the  shear  transform  is  to  remove  the  slant  from  the  character 
images.  Pixel  histograms  are  calculated  for  the  top  and  bottom  few  rows  of  the 
image.  The  centers  of  these  two  distributions  determine  a virtual  hne  between 
them,  which  is  used  to  construct  the  edges  of  a parallelogram  around  the  char- 
acter. Each  row  of  the  image  is  shifted  horizontally  to  bring  the  edges  of  the 
parallelogram  into  vertical  alignment. 

3.2.  Gabor  Feature  Extraction 

In  classifying  images,  each  32  pixels  by  32  pixels  in  size,  to  decide  which  character 
is  represented  in  each  image,  a neural  network  would  need  1024  input  values, 
one  for  each  pixel,  if  the  image  pixel  values  were  fed  directly  to  the  network. 
This  network  is  larger  than  necessary.  To  get  a smaller  network,  the  essence 
of  each  image  must  somehow  be  distilled  into  a smaller  group  of  numbers,  each 
describing  some  feature,  with  the  value  of  each  feature  used  cis  one  of  the  input 


values. 


Various  features  are  possible.  Recently  the  Gabor  functions^®  have  been  pro- 
posed as  a model  of  mammalian  visual  receptive  fields^®  and  as  such  are  a bi- 
ologically based  method  of  extracting  sahent  features  from  characters.  Such 
features  have  been  used  with  considerable  success  in  machine  print  OCR^^  and 
handprint  OCR.^® 


3.2.1.  Gabor  Filtering 


Gabor  functions  come  in  even  and  odd  variants: 


<Acven(a:,y)  = e cos(u;r') 

<f>odd{x,y)  = e~’‘^/‘^%in(u7r') 


(16) 


where  = x'^  -|-  and  <t  and  uj  are  parameters.  The  exponential  term  is  a 
Gaussian  envelope  that  localises  the  feature  extractor.  Coordinates  x'  and  y' 
are  transforms  of  x and  y,  shifted  and  rotated  by  angle  6: 


(17) 
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Note  that  ]/  is  absent  from  Equation  16  so  that  the  function  represents  a two- 
dimensional  sinusoid,  offset  by  (a;o,yo)i  locahsed  in  space  by  the  Gaussian,  and 
oriented  at  angle  {6). 

Thirty-two  such  fimctions  were  used;  the  set  of  the  five  parameter  Vcdues  was 
chosen  empirically  on  the  groimds  of  efficacy  and  is  as  follows.  Assume  the  N 
X N image  has  its  origin  at  its  center  (0,0).  Use  identical  Gabor  functions  in 
each  of  the  four  quadrants  centered  at  positions  (±7V/4,  ±7V/4).  Within  each 
quadrant,  use  four  orientations  at  angles  0,  7r/4,  7r/2,  and  37r/4  radians.  Use 
one  spatial  frequency,  u = 27t\/2/N,  and  one  Gaussian  envelope  radius,  a = 
N\/2/8.  Use  both  even  (cosine)  and  odd  (sine)  functions. 

Approximate  the  image  by  a hnear  combination  of  the  32  Gabor  functions;  ob- 
tain the  coefficients  by  a least-squares  fit.  These  32  coefficients  are  the  extracted 
features  in  the  FL3  data  set;  they  are  the  inputs  to  the  multilayer  perceptron 
networks  in  the  next  section. 


4.  Results 

The  initial  weights  were  chosen  from  a uniform  random  distribution  in  the  range 
(—0.5,  4-0.5).  Since  different  minima  are  found  with  different  starting  weights, 
a statistical  samphng  of  runs  is  necessary,  each  run  with  a different  seed  to 
start  the  random  number  generator.  Each  run  finds  some  local  minimum,  but 
the  particular  one  found  is  a sensitive  fimction  of  the  initial  choice  of  weights. 
In  fact,  two  different  brands  of  computers  will  ordinarily  not  find  the  same 
local  minimum,  even  given  the  same  initial  choice  of  weights;  extremely  minor 
differences  in  doing  arithmetic  are  sufficient,  over  many  iterations,  to  cause 
different  local  minima  to  be  found. 

All  the  rims  use  the  same  classification  rules  to  divide  the  results  into  three 
categories:  Right,  Unknown,  and  Wrong.  For  each  pattern,  the  highest  of  the  10 
output  values  is  found.  If  the  activation  level  is  below  a set  activation  threshhold, 
the  result  is  defined  to  be  Unknown.  If  the  activation  level  at  least  as  large  as 
the  threshhold,  the  result  is  accepted,  and  the  answer  is  either  Right  or  Wrong. 
Except  for  Figure  3,  all  results  reported  here  use  a zero  activation  threshhold, 
ehminating  Unknowns. 

For  both  BP  and  SCG,  by  far  the  most  time-consuming  part  of  the  calculation  is 
done  by  the  forward  subroutine,  which  feeds  the  inputs  forward  to  the  outputs 
according  to  Equation  1,  then  feeds  outputs  backwards  according  to  Equation  7 
to  produce  the  gradient  of  the  error  function  (step  4 of  the  algorithm  in  Section 
1.1).  BP  does  this  once  per  iteration;  SCG  usually  does  this  twice  per  iteration, 
but  occasionally  more  than  twice.  For  comparing  BP  and  SCG,  we  therefore 
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compare  the  number  of  calls  to  forward  rather  than  the  number  of  iterations. 

Comparisons  of  algorithms  depend  strongly  on  the  stoppping  criteria  described 
in  Section  2.  The  runs  presented  here  use  one  of  two  sets  of  stopping  criteria. 
The  early  set  uses  criterion  5,  stopping  when  the  number  of  correctly  classified 
training  patterns  increases  too  slowly.  Criteria  2,  3,  and  4 are  disabled  by  setting 
C2,  Ca,  and  C4  to  zero.  (Criterion  4 is  not  quite  disabled  by  C4  = 0;  if  a local 
minimum  is  found,  the  error  cannot  be  decreased  further.)  Good  values  for  SCG 
are  C5  = 1 and  K = 10;  C\  = 1000  is  safe,  since  at  most  a few  hundred  iterations 
are  necessary.  For  BP  it  is  necessary  to  use  a larger  iv , such  as  ii  = 100,  and  to 
allow  a larger  number  of  iterations.  The  early  set  strikes  a reasonable  balance 
between  the  level  of  convergence  obtained  and  the  computer  time  used. 

The  late  set  emphasizes  the  level  of  convergence  at  the  expense  of  computer 
time;  it  uses  criterion  1,  stopping  when  the  maximum  number  of  iterations  has 
been  used.  Criteria  2,  3,  and  4 are  disabled  as  in  the  early  set,  with  zero  values 
for  C2,  C3,  and  C4.  Criterion  5 is  disabled  by  C5  = —100.  Reasonable  values  for 
SCG  are  K = 10  and  Ci  = 1000.  BP  needs  K = 100  and  Ci  = 10000  or  more. 

In  both  sets,  instead  of  C3  = 0,  a value  hke  C3  = 10“^^  may  be  used  for  quicker 
stopping  in  case  of  accidentally  finding  a local  minimum  quickly. 

4.1.  Training  on  the  training  set 

Training  on  the  2000-pattem  “training  set,”  followed  by  testing  on  the  1434- 
pattern  “testing  set,”  gives  a reahstic  picture  of  what  can  be  expected.  Note 
that,  for  a practical  application,  only  the  best  set  of  weights  found  is  important, 
since  any  inferior  sets  will  be  discarded. 

Table  1 illustrates  the  effect  of  changing  the  number  of  hidden  nodes  in  the 
network.  Eight  hidden  nodes  are  clearly  insufficient,  but  after  24  hidden  nodes 
the  improvement  is  small.  In  general,  as  the  number  of  hidden  neurons  increases, 
training  and  testing  improve  somewhat,  training  is  more  fikely  to  find  one  of  the 
poorer  local  minima,  and  training  takes  more  calls  to  forward. 

Choosing  the  Levenberg-Marquardt  parameter,  fi,  must  be  done  experimentally; 

= 10“^  is  reasonable;  this  value  does  not  minimize  the  S error,  but  it  gives 
better  classification  results  on  both  the  training  and  the  testing  sets. 

Figures  1 (early  stopping  criteria)  and  2 (late  stopping  criteria,  Ci  = 1000) 
illustrate  the  variability  of  results.  Each  figme  shows  results  from  computer 
runs  starting  from  50  different  random  seeds  for  fi  = 0 and  for  //  = 10  The 
activation  threshhold  is  zero,  so  that  there  are  no  Unknowns.  Note  that  fi  = 10 
is  slightly  better  than  fj  = 0 and  that  late  stopping  gives  shghtly  better  results 
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than  early  stopping,  although  at  a significant  penalty  in  computer  time;  the 
median  number  of  calls  to  forward  is  303  for  early  stopping  and  2002  for  late 
stopping.  Also  note  that  there  is  httle  correlation  between  the  performances  on 
the  training  set  and  on  the  testing  set. 

Typical  results  for  percent  correct  on  the  testing  set  as  a function  of  the  number 
of  Unknowns  is  shown  in  Figure  3.  For  each  run,  the  activation  threshhold 
was  successively  changed  to  generate  the  desired  percent  of  Unknowns;  then  the 
remainder  of  the  test  pattern  results  were  classified  as  Right  or  Wrong.  For  each 
value  of  ten  random  starts  were  done  and  the  nm  with  the  best  number 

correct  at  zero  activation  threshhold  was  chosen.  Note  that  httle  improvement 
is  seen  for  > 16. 

4.2.  Training  on  the  full  set 

To  demonstrate  that  it  is  possible  to  have  a set  of  weights  that  does  equally  well 
on  both  sets,  some  runs  trained  on  all  3434  patterns.  As  an  example,  with  24 
hidden  nodes,  training  produced  a set  of  weights  which  gave  98.9%  right  on  the 
full  set  and  99.1%  on  the  testing  set. 

4.3.  Compzirison  with  backpropagation 

In  order  to  compare  CG  and  BP,  reasonable  values  of  rj  and  a must  be  foimd. 
Because  of  the  in  the  definition  of  £(w;  /z),  our  rj  is  larger  by  a factor  of 

than  the  usual  one,  and  is  to  first  order  independent  of  Table  2 gives 

the  training  error  after  500  iterations  of  BP  using  the  first  500  patterns  of  the 
training  set,  all  with  the  same  random  number  seed.  Note  that  500  iterations 
(501  calls  to  forward)  are  insufficient  to  train  the  32-16-10  network. 

From  this  survey  and  similar  ones,  reasonable  values  are  rj  = 10  and  a = 0.25. 
(No  claim  is  made  that  these  are  ideal  values.) 

Figures  4 and  5 compare  single  rims  of  BP  and  SCG  for  a 32-24-10  network, 
with  late  stopping  and  /z  = 10“^,  with  iv  = 10  for  SCG  and  K = 100  for  BP. 
Note  that  any  stopping  criterion  based  on  slowness  of  decrease  of  the  error  or  of 
the  number  wrongly  classified  is  hkely  to  stop  weU  before  the  lowest  values  are 
obtained. 


5.  Conclusions 

SCG  is  faster  than  BP.  An  exact  comparison  is  hard  to  obtain,  but  SCG  appears 
to  be  about  10  times  faster  than  BP. 

FL3  is  a reasonable  data  set  to  use,  but  is  not  a large  enough  set  for  training  a 
practical  network  for  classifying  handprinted  digits.  If  it  were  large  enough,  the 


12 


recognition  rate  on  the  testing  set  would  not  be  appreciably  lower  than  the  rate 
on  the  training  set. 
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Figure  captions 


Figure  1.  Scatter  plot  of  testing  re- 
sults vs.  training  results  for  32-24-10 
networks,  early  stopping.  Open  circles: 
/i  = 0;  filled  circles:  p = 10“^. 

Figure  2.  Scatter  plot  of  testing  re- 
sults vs.  training  results  for  32-24-10 
networks,  late  stopping.  Open  circles: 
//  = 0;  filled  circles:  p = 10“^. 

Figure  3.  Percent  correct  as  a func- 
tion of  the  percent  Unknown.  32-N^^^- 
10  networks,  early  stopping,  p = 10“^. 
filled  circles:  = 8;  filled  triangles: 

= 16;  open  circles:  = 24; 

open  triangles:  = 32. 

Figure  4.  Error  vs.  number  of  function 
calls  for  a 32-24-10  network,  late  stop- 
ping, p = 10“^.  filled  circles:  SCG; 
open  circles:  BP. 

Figure  5.  Percent  wrong  vs.  number  of 
function  calls  for  a 32-24-10  network, 
late  stopping,  p = 10“^.  filled  circles: 
SCG;  open  circles:  BP. 


Table  1:  Results  of  training  32-A^(^Uio 
feed-forward  networks  with  SCG,  p = 
10“^,  and  early  stopping.  Tabular  val- 
ues are  the  number  of  calls  to  forward, 
the  percent  correct  on  the  training  set, 
and  the  percent  correct  on  the  testing 
set,  for  the  best  run,  selected  from  10 
runs  with  random  starts. 


Ar(i) 

Calls 

Training 

Testing 

8 

321 

94.9% 

93.0% 

16 

321 

98.1% 

95.1% 

24 

322 

98.8% 

95.2% 

32 

422 

99.3% 

95.7% 

40 

401 

99.4% 

95.3% 

48 

507 

99.2% 

95.2% 

Table  2:  Results  of  training  a 32-16- 
10  feed-forward  network  with  500  iter- 
ations of  backpropagation  on  the  first 
500  patterns  of  the  training  set  with 
seed  12345  and  various  values  for  rj  and 
a.  The  tabular  value  is  the  error. 


V 

a 

0.10 

0.25 

0.50 

0.75 

0.90 

1 

0.269 

0.263 

0.247 

0.203 

0.260 

2 

0.240 

0.229 

0.196 

0.162 

0.273 

5 

0.166 

0.153 

0.131 

0.238 

0.285 

10 

0.124 

0.116 

0.196 

0.285 

0.316 

20 

0.197 

0.222 

0.285 

0.316 

0.316 

15 


-J 


V.v  . 'V 


- «*■■  ■ 

.It,  * ' ii  s 

">•  ..  J:V 


\^''''\3r:i 


iiJK  ‘JS 

mn 


'■>'; , 


'/  ’■''''■ 'l 


m 


>.i"'a« 


■ ’'  r if.  ■'■’'=‘v/.. 


V -u./  i.-.  -a.  ,• 

;•  •■•  . .■•/<  v,  / >'\ 


hidf'' 


r.  j»”  •'  -.r.  ..  ..-rrp^-’ • 


yi'..D-  , i,XA\ 


T© 


. ..  . ■ “I ' ' 


■X»( 


a:it.o  € 


:'ii 


/I" 


i.^y 


y’.Ji.*]r?. 


jS£,i)  M£i>  #|.4'  i9i;0. 

'um , : 4te.o  '}&«  a®,  ■^;(}' 

ftK.C  iat'S.  .SS#;;,;«|.iJ 

' ,'  ■'."  ■:if  :'  V 


;S(fA^pS 


. J' 


i..’=  -■ 


■ :Vi; 


FIG.  1 


O 


a> 


CO 

a> 


CO 

(T> 


in 

CD 


in 

CD 


CD 


CD 


ro 

O) 


m 

CD 


^OSJJOQ  |U90J8cl  :5uj^S3i 


17 


97.0  97.6  98.2  98.8  99.4  100. 

Training:  Percent  Correct 


FIG.  2 


O 


ai050^0^CJ)Cn0>0>05 


^OSJJOQ  ;U90J8cl  :bu\\SBl 


18 


97.0  97.6  98.2  98.8  99.4  100. 

Training:  Percent  Correct 


100 


FIG.  3 


19 


Percent  Unknown 


FIG.  A 


(9A|^D|9J)  J0JJ3 


20 


5000  10000 

Function  Calls 


FIG.  5 


6uojm  luaojScl  :6ujU|DJX 


o 


21 


5000  10000 

Function  Calls 


i!IST-114A 
iEV.  3*90) 


U.S.  DEPARTMENT  OF  COMMERCE 

national  institute  of  standards  and  technology 

BIBLIOGRAPHIC  DATA  SHEET 


1.  PUUJCATION  OR  REPORT  NUMBER 

NISTIR  A776 


Z PERFORMING  ORGANIZATION  REPORT  NUMBER 


3.  PUBUCATION  DATE 

FEBRUARY  1992 


TITLE  AND  SUBTITLE 

Training  Feed  Forward  Neural  Networks  Using  Conjugate  Gradients 


AUTHOR(S) 

Patrick  J.  Grother  and  James  L.  Blue 


PERFORMING  QBOAMIZATION  (IF  JOINT  OR  OTHER  THAN  NIST.  SEE  INSTRUCTIONS) 

U.S.  DEPARTMENT  OF  COMMERCE  ___ 

NATIONAL  INSTITUTE  OF  STANDARDS  AND  TECHNOLOGY 
GAITHERSBURG.  MD  20899 


7.  CONTRACT/GRANT  NUMBER 


8.  TYPE  OF  REPORT  AND  PERIOD  COVERED 


" SPONSORING  ORGANIZATION  NAME  AND  COMPLETE  ADDRESS  (STREET,  CITY,  STATE,  ZIP) 


10.  SUPPLEMENTARY  NOTES 


MSTMCT  (A  200.WO.D  O.  L£SS  FACTUM.  SUMMMY  OF  yOST  SIOWICAHT  IWFOKyATIO..  IF  DOCUMBrT  INCU.OES  A B^JOQI^PHY 

LITERATURE  SURVEY,  MENTION  IT  HERE.) 

Neural  networks  for  optical  character  recognition  are  still  being  trained  using  back 
propagation,  even  though  conjugate  gradient  methods  have  been  shown  to  be  much  faster. 

Most  multilayer  perceptron  network  training  results  in  the  literature  are  obtained  for 
small  and  unrealistic  problems  or  from  data  sets  that  are  proprietary  and  not  available 
for  comparison  testing.  We  present  results  on  a large  realistic  pattern  set  containing 
2000  training  and  1434  testing  exemplars.  Each  pattern  is  composed  of  32  Gabor 
coefficients  obtained  from  a 32  by  32  pixel  binary  image  of  a hand-written  digit 
segmented  from  the  NIST  Handwriting  Image  Data  Base.  These  sets  are  believed  to  ^ave 
ap^oximately  1%  segmentation  errors.  Comparative  results  for  Holler  s scaled  conjugate 
gradient  method  and  for  standard  back  propagation  are  presented  for  runs  on  a seria 
Lientific  workstation  and  a highly  parallel  computer.  Typical  training  on  a network 
with  32  inputs,  32  hiden  nodes,  and  10  output  nodes  gives  a 98%  to  99%  recognition  or 
the  training  set  and  95%  for  the  test  set.  Training  with  conjugate  gradients  requires 
fewer  than  200  iterations;  times  are  about  20  to  40  minutes  on  a scientific  workstation 
and  6 minutes  on  the  highly  parallel  computer.  Testing  (classification)  is  <1°“®  ^ 
rate  of  600  and  1600  patterns  per  second  on  the  scientific  workstation 

parallel  computer  respectively.  These  results  suggest  that  commercial  hand-written  characteij 
recognition  systems  with  great  economic  potential  are  feasible. 

a KEYWoaos  mTO'2SHTa.ts^.lP«MgT.Cimo.DEK;CwaTMiZSOaLY>.o.Ea.am.ES:«a.sePM«T.ae,wo.uS.YSEmicoLoas, 

multilayer  perceptrons;  scaled  conjugate  gradients;  OCR;  parallel  neural  networks 


oBTamunoa.  oo  «.t  amsass  to  H.T,o«LTSCH.,cm.  ,N«.«aT.OH  seavca  pms). 


15.  PRICE 


ELECTRONIC  FORM 


■r " 


1 


l^_. 


•*'«M  -li'  -; 


3Xd«XS:r«?:ft!-fe 
^r^£aXtkt.^fX>  Jim  ^ 


Sc  It?  .fw.;* 


I ,a;i4«)l,fc4s5g' 

' sd3  ,3;!!'  'Jf/io/fe 

y I d »d.t  iM#® M. 


u, 


'-"'■  ' -'i  ■■'-■It  ^ rf  w %-4’l.i'^)  ' 


m 


